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Abstract 

Non-negative matrix factorization (NMF) approximates a non-negative matrix X by a 
product of two non-negative low-rank factor matrices W and H . NMF and its extensions 
minimize either the Kuhback-Leibler divergence or the Euchdean distance between X and 
W^H to model the Poisson noise or the Gaussian noise. In practice, when the noise dis- 
tribution is heavy tailed, they cannot perform well. This paper presents Manhattan NMF 
(MahNMF) which minimizes the Manhattan distance between X and H for modeling 
the heavy tailed Laplacian noise. Similar to sparse and low-rank matrix decompositions, 
e.g. robust principal component analysis (RPCA) and GoDec, MahNMF robustly esti- 
mates the low-rank part and the sparse part of a non-negative matrix and thus performs 
effectively when data are contaminated by outliers. We extend MahNMF for various practi- 
cal applications by developing box-constrained MahNMF, manifold regularized MahNMF, 
group sparse MahNMF, elastic net inducing MahNMF, and symmetric MahNMF. 

The major contribution of this paper lies in two fast optimization algorithms for Mah- 
NMF and its extensions: the rank-one residual iteration (RRI) method and Nesterov's 
smoothing method. In particular, by approximating the residual matrix by the outer prod- 
uct of one row of W and one row of H in MahNMF, we develop an RRI method to iteratively 
update each variable of W and in a closed form solution. Although RRI is efficient for 
small scale MahNMF and some of its extensions, it is neither scalable to large scale matrices 
nor flexible enough to optimize all MahNMF extensions. Since the objective functions of 
MahNMF and its extensions are neither convex nor smooth, we apply Nesterov's smoothing 
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method to recursively optimize one factor matrix with another matrix fixed. By setting 
the smoothing parameter inversely proportional to the iteration number, we improve the 
approximation accuracy iteratively for both MahNMF and its extensions. 

We conduct experiments on both synthetic and real- world datasets, such as face images, 
natural scene images, surveillance videos and multi-model datasets, to show the efficiency 
of the proposed Nesterov's smoothing method-based algorithm for solving MahNMF and 
its variants, and the effectiveness of MahNMF and its variants, by comparing them with 
traditional NMF, RPCA, and GoDec. 

Keywords: Non- negative Matrix Factorization (NMF), Nesterov's Smoothing Method, 
Sparse and Low-rank Matrix Decomposition 



1. Introduction 



Non-negative matrix factorization (NMF) is a popular matrix factorization approach that 
approximates a non-negative matrix X by the product of two non-negative low-rank factor 
matrices W and H. Different to other matrix factorization approaches, NMF takes into ac- 
count the fact that most types of real-world data, particularly all images or videos, are non- 
negative and maintain such non-negativity constraints in factorization. This non-negativity 
constraint hel ps to learn parts-based represent a tion supported by psychologic al and phys- 



ical evidence ( Logothetis and Sheinberg . 19961 ) ( Wachsmuth and Orarn . 1994). Therefore , 



20071 ). face recognition (IZhang et al.l. l2008ll. video pr ocessing (jBucak and Gunsel . ,2007. ) . 



NMF achieves .reat suc cess in many fields such as ima.e an alysis kon.a a,nd Mihcafl; 



and environmental science (jPaatero and Tapped . 1 19941 ). 

NMF was first proposed by Paatero and Tapper (Paatero and Tapper . 19941 ) and was 
greatly popularized by Lee and Seung (|Lee and Seund . 119991 )7 Since then, many NMF 
variants have be en proposed and have achieved great success in a variety of tasks. For 
example, Hoyer (jHoverl . |2004| ) proposed sparseness constrained NMF (NMFsc) to enhance 
the sparseness of the learned factor matrices for computer vision tasks. Zafeiriou et al. 
( Zafeiriou et al. . 20061 ) propos ed discrirn i nant N MF (DNMF) to incorporate Fisher's criteria 
for classification. Cai et al. ( Cai et al. . 201 ll ) proposed graph regularized NMF (GNMF) 
to incorporat e the geometric structure of a d ataset for clustering. Recently, Sandler and 
Lindenbaum ( Sandler and Lindenbaum . 201 ll ) proposed an earth mover's distance metric- 
based NMF (EMD-NM F') to model the distortion of images for several vision tasks. Liu 
et al. (|Liu et al.l . l2012l ) proposed a constrained NMF (CNMF) to incorporate the label 
information as additional constraints for image representation. 

Prom the mathematical viewpoint, traditional NMF ( Lee and Seung . 19991 ) ( Lee and Seung . 

l200lh and its variants minimize the Kullback-Leibler divergence and the Euclidean distance 
between X and W^H to model the Poisson noise and Gaussian noise, respectively. Here, we 
call them KLNMF and EucNMF for short. Both KLNMF and EucNMF are popular because 
they can be efficiently optimized by using the multiplicative update rule (ILee and Seung . 
l200lh . However, the noise in many practical applications is heavy tailed, so it cannot be well 
modeled by either Poisson distributi on or Gaussia n distribution. For example, the gradient- 
based image features s uch as SIFT (jLowel . l2004l ^ contain non-Gaussian heavy tailed noise 
(Ijia and Darrelj . I2OI1I ). [n these cases, traditional NMF does not perform well because it 
is not robust to outliers such as occlusions, Laplace noise, and salt & pepper noise, whose 
distribution is heavy tailed. 
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On the other hand, real- world data often he s in a lower-dimensional subspace; for exam- 
ple, Basri and Jacobs (jBasri and Jacobsl . [iooi l showed that images taken from convex and 



Lambertian objects under distant illumination lie near an approximately ni ne-dimensiona l 



linear subspace. Recently, robus t priii cipal component analysis (RPCA, (ICandes et al. 



201 ih ) and GoDec dzhou and Taol . l2nilh have been proposed to robustly recover the lower- 



dimensional space in the presence of outliers. Both RPCA and GoDec consider the prior 
knowledge that noise, e.g., illumination/shadow in images and moving objects in videos, is 
sparse, and thus perform robustly in practice. Traditional NMF cannot robustly estimate 
the low-rank part of the data contaminated by outliers because it does not consider such 
prior knowledge of the sparse structure of noise. 

In this paper, we present Manhattan NMF (MahNMF) to robustly estimate the low- 
rank part and the sparse part of a non-negative matrix. MahNMF models the heavy tailed 
Laplacian noise by minimizing the Manhattan distance between an m x n-dimensional non- 
negative matrix X and W^H, i.e., 

min f(W,H) = \\X -W^H\\m, (1) 

W>0,H>0 ^ ' ^ II II™' ^ ^ 

where || • \\m is the Manhattan distance and the reduced dimensionality r satisfies that 
r <C min(m,n). Since both W and H are low-rank, MahNMF actually estimates the non- 
negative low-rank part, i.e., W^H, and the sparse part, i.e., X — W^H, of a non- negative 
matrix X. Benefiting from both the modeling capability of Laplace distribution to the heavy 
tailed behavior of noise and the robust recovery capability of the sparse and low-rank decom- 
position, such as RPCA and GoDec, MahNMF performs effectively and robustly when data 
are contaminated by outliers. We further extend MahNMF for various practical applica- 
tions by developing box-constrained MahNMF, manifold regularized MahNMF, and group 
sparse MahNMF. These extensions follow the regularization theory by integrating Mah- 
NMF with various regularizations. By taking into account the grouping effect of the sparse 
part, we develop the elastic net inducing MahNMF to learn the low-rank and group sparse 
decomposition of a non-negative matrix. Inspired by spectral clu stering, we develop a sym- 
metric MahNMF for image seg mentation. Although QH) tried to model Laplacian 



noise in NMF, it cannot be used in practice because the semi-definite programming-based 
optimization method used suffers from both slow convergence and non-scalable problems. 

The main contribution of this paper lies in two fast optimization methods for MahNMF 
and its extensions: the rank-one residual iteration (RRI) method and Nesterov's smoothing 
method. In particular, RRI approximates the residual matrix with the outer product of one 
row of W and one row of H in ^ and iteratively updates each variable of W and H in a 
closed form solution. RRI is efficient for optimizing small-scale MahNMF and some of its 
extensions, but it is neither scalable to large scale matrices nor flexible enough to optimize 
all MahNMF extensions. Since the objective functions of MahNMF and its extensions are 
neither convex nor smooth, we apply Nesterov's smoothing method to recursively optimize 
one factor matrix with another matrix flxed. By setting the smoothing parameter inversely 
proportional to the iteration number, we improve the approximation accuracy iteratively 
for both MahNMF and its extensions. 

We conduct experiments on both synthetic and real-world datasets, such as face im- 
ages, natural scene images, surveillance videos and multi-model datasets, to show the 



3 



N. GuAN, D. Tao, Z. Luo, and J. Shawe- Taylor 



efficiency of tlie proposed Nesterov's smootliing metfiod-based algoritlims for optimizing 
MaliNMF and its variants and their effectiveness in face recognition, image clustering, back- 
ground /illumination modeling, and multi-view learning by comparing them with traditional 
NMF, RPCA, and GoDec. 

The remainder of this paper is organized as follows: Section II presents the rank-one 
residual iteration (RRI) method for optimizing MahNMF, and Section III presents Nes- 
terov's smoothing method. Section IV presents several MahNMF extensions which can be 
solved by using the proposed Nesterov smoothing method-based algorithm. In Section V, 
we conduct experiments to show both the efficiency of Nesterov smoothing method-based 
algorithm for MahNMF and the effectiveness of MahNMF and its variants. Section VI 
concludes this paper. 

Notations: We denote by a lower headed x and a capital X a scalar, vector 

and matrix, respectively. In particular, 1 signifies a vector full of one, / signifies an identity 
matrix, and signifies zero, zero vector or null matrix. We denote by bracketed subscript 
and superscript the elements of a vector or a matrix, e.g., x^^) signifies the k-th element of x, 
and ^(fc), X^^\ signify the A;-th row, the k-th column, and the (z, j)-th element of X, 

respectively. We denote by a subscript, e.g., Xk and Xk the points in a sequence. We denote 
by R and R+ the set of real numbers and the set of non-negative real numbers, respectively. 
Consequently, R™ and Rl^*^" signify the set of m-dimensional non-negative vectors and the 
set of m X n-dimensional non-negative matrices, respectively. We denote by and ||x||z2 
the li and I2 norm of a vector x, respectively. For any matrices X G j^'^x'' and Y S R™^'', 
we denote their Euclidean distance (Frobenius norm) and Manhattan distance as — yHj? 
and \\X — Y\\m, respectively. In addition, we denote hy X oY and y their element-wise 
product and division, respectively. 

2. Rank-one Residual Iteration Method for MahNMF 

Since the objective function ([T]) is non-convex, we recursively optimize one factor matrix W 
or H with another fixed, i.e., at iteration t > 0, we update 

Ht+i = argminj:^>oll^ " W^^HWm, (2) 

and 

Wt+i = argmin^>oll-^^ " H^+iW\\m, (3) 

until convergence. The convergence is usually checked by the following objective-based 
stopping condition: 

\f{Wt,Ht)-f{Wt+uHt+i)\<^, (4) 

where ^ is the precision, e.g., ^ = .1. Because problems ([2|) and ([3]) are symmetric, we focus 
on optimizing ^ in the following section, and ([3|) can be solved in a similar way. 

Although (l2|) is convex, the Manhattan distance-based objective function, i.e., f(Wt, H), 
is non-differentiable when X — WJH contains zero elements. This means that the gradient- 
based method cannot be directly applied to optimizing ([2]). Fortunately, we will show that 
each variable in H can be updated in a closed form solution and thus ([2]) can be optimized 
by using alternating optimization over each variable of H. Given W"^ and rows of H except 
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-f^(i), eq. ([2]) can be written as 



mm \\Z -Wfi^H(i)\\M, (5) 
ti{i)>y> 



where Z = X — X][=i j^/ VF^"^-j-fr(j) is the residual matrix. Actuahy, Eq. ([5]) is a rank one 
approximation of the residual matrix. Therefore, following ( Ho et al. . 201 ll ). we term this 
method the rank-one residual iteration (RRI) method. 

Since dSJ is convex and separable with respect to each variable -ff(jj), wherein j G 
{1, ...,n}, there exists the optimal solution and -ff(ij) is updated as follows 

min - Wfi^H^ij^Wi = \W^i^i)H^i^j) - + ... + - 

-H(ij)>(J 

= Caa)(%.))- (6) 

Looking carefully at C,(i^j-^[H(i^j-^), it is a continuous piecewise linear function whose piecewise 

z 

points are P = {ps = w^^\W(i,is) 7^ Oj^s £ {l,...,m},s = g < m}. It is obvious 

that the minimum of C(/,j)(-f^(/j)) appears at one point of P. Regardless of the constraint 
H{ij) > 0, the point that first changes the sign of the slope of is its minimum. By 

sorting P in an ascending order, we have Pgi < • • • < p^c < • • • < Psi, wherein G {1, g}. 
Furthermore, by sorting {VF(;j^^),c = l,...,q} accordingly, we can remove the absolute 
operator in ^ and rewrite it into q + 1 pieces as follows 



(.-^m^i) - - - ^ii,is^))^ + ^(isid) + - + ^ ^ Ps^ 

i^m^i) + - + ^(/,*.,))^ - - - - Ps" < X 



Since > 0) the slope of the piecewise function in ^ is increasing. It is easy to find 

the point which first changes the sign of slope. Suppose first changes the signs of slope of 
C(ij)(x), i.e., W(^i^i^^)+...-W(^^^,)-...-W(^i^i^^^ < andWi^i^i^^) + ...+W(^^^,)-...-W(^^^^) > 0. 
It is clear that ps^ minimizes Note that the minimum is not unique because the 

slope at p^c may be zero. See Figure [1] for three examples of piecewise functions; it is clear 
that —1 and 1 minimizes /i and /2 and any point in the range [—1, 1] minimizes fs- Taking 
into account the non-negativity constraint, we obtain the solution of ([6|) as max{0,psc}. In 
the case of /s in Figured! we simply take the leftmost point as its optimal solution. 

We summarize the RRI method in Algorithm [TJ It successively updates each row of H 
and stops when the following stopping condition is satisfied 

\f{W,Hk+i)-f{W,Hk)\<e, (8) 

where e is the precision, e.g., e = .1. By recursively solving ([2|) and ([3]) with Algorithm^ 
the MahNMF problem ([1]) can be successfully solved. The previous variable is used as a 
warm start, i.e., Hq = , to accelerate convergence of the RRI method. 

The main time cost of Algorithm^is spent on sentence 6 that sorts the piecewise points. 
Its time complexity is 0{nmlogm) because the sorting operator for each piecewise point 
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X 

Figure 1: Piecewise function examples. The minimums of /i and /2 appear at —1 and 1, 
respectively, while the minimum of /a appears at the range of [—1, 1]. 



Algorithm 1 RRI Method for MahNMF 
Input: X G R![^^", W G R^^™, G R^^'^. 
Output: i/^+i. 

1: Initialize W'l = [W^i^i^), ...,Wii^i^^)], I G l,...,r, G l,...,m. 
2: For k = 0,1,2,... 
3: For / = 1, r 

4: Compute Z = X-EI=M^/W^(T)^fc». 

5: Compute Pj = {ps = ^^^\Wi^i,i,) / 0,Zs G {l,...,m},s = for j = 

6: Sort Pj simultaneously and sort according to P/s order, for j = 1, ...,n. 
7: Find the piecewise point that first changes the sign of Qi^) in ©• 
8: Update Hk+i[ij^ = max{0,p3Cj } simultaneously for j = 1, ...,n. 
9: End For 

10: Check the stopping condition ([8]). 

11: End For 

12: H^j^-^ = i^fc+i. 



set costs 0{mlogm) time in the worst case. Sentence 7 finds the piecewise point that first 
changes the sign of C{i,j){x) from the sorted set. This can be done by initializing the slope as 
— X^c=i ^ii,isc) and increasing it by 2W^r; j^^) at the c-th step. Once the sign of slope changes, 
the procedure stops and outputs . The worst case time complexity of this procedure is 
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0{mn). Therefore, the total complexity of Algorithm^is 0{mnr{[ogm + 1)) x where K 
is the iteration number. Since ^/gorii/im [1] finds the closed form solution for each variable 
of H, it converges fast. However, the time complexity is high especially when m is large. 
Therefore, RRI is not scalable for large scale problems. In the following section, we propose 
an efficient and scalable algorithm for optimizing MahNMF. 

3. Nesterov's Smoothing Method for MahNMF 

Since the Manhattan distance equals the summation of the li norm, i.e., \\X — W^" H\\m = 
Z]j=i ll^*'''^ ~ ^^^'^Wh^ minimization problem ([2]) can be solved by optimizing each 
column of H separately. For the j-th column, the sub-problem is 

min \\X^^^ -WfH^^^Wi^. (9) 
H(J)>0 

Without the non-r iegatiyity co nstraint, eq. ([9]) shrinks to the well-known least absolute 
deviations (LAD, dKarstl . [195^)) regression problem. Here, we term ^ as a non-negativ e 



LAD (NLAD) problem for the convenience of presentation. According to ( Harter . 19741 ) . 



LAD is much more robust than the least squares (LS) method especially on the datasets 
contaminated by outliers. NLAD inherits the robustness from LAD and keeps the non- 
negativity capability of datasets. 

For any given observation x = X^-^^ G R™, 1 < j < n and a matrix W = Wt £ R*^^*", 
the NLAD problem Q can be written as 

m 

mm{f{W,h) = \\W^h-x\\i, = < W^^,h>i -%| ■.heQi = R+}, (10) 

where < •, • >i signifies the inner product in R^'. Define the norm that endows the domain 
El = W as \\h\\i = \\h\\i^ = (X^j=i /iq)) ^ , and construct the prox-function for the feasible 

set Qi as di(h) = ^\\h\\1. It is obvious that di{-) is strongly convex and the convexity 
parameter is (5i = 1. Since Ei is a self-dual space, we know that the dual norm \\w\\\ = 
maXy-{< w, y >i: y G Ei, \\yi || = 1} = WwWi for any w £ Ei. 

Since f{W,h) is convex and continuous, there must be an optimal solution. However, 
it cannot be solved directly by using the gradi ent-based method because f{W, h) is non- 
smooth. Fortunately, Nesterov (|Nesterovl . V2m^ \ shows that f{W,h) 

can be approximated 



by a smooth function. In particular, we first construct a dual function for the primal non- 
smooth function and smooth the dual function by adding a smooth and strongly convex 
prox-function for the feasible set of the dual variable. Then we solve the smoothed dual 
function in the dual space and project the solution back to primal space. The obtained 
solution can be considered as an approximate minimum of the primal function. By choosing 
the dual domain E2 = R™' and the feasible set Q2 = {P- £ E2 : < 1,? = l,...,m}, 

wherein fl is the dual variable, the primal problem (jlOp is equivalently rewritten as 

min{/(VF, h) = max{< W'^h - x,fl >2- P e Q2} ■ h G Qi}, 
h 

where < •, • >2 is the inner product in R™. The corresponding dual problem is 
m&x{(f>{fi) = mm{< W^h — x,fl >2'- h G Qi} : fl G Q2}- 

u. u 
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Since v ~ Wh, Qi is bounded, i.e., there exists a positive number Mi such that < Mi 
for any h £ Qi. Then the dual function 0(/2) can be calculated explicitly, i.e., 4>{jl) =< 
W'^ifM-iiWfl) — x,fl >2, wherein V'Mi(') is an element-wise operator defined as ipMi{a) = 
0, a > 



Ml, a < 



Since it is difficult to estimate Mi , the dual problem is still difficult to solve. 



However, it can be easily solved by adding a simple prox-function. According to (iNesterov 



200i), we define the prox-function for Q2 as d2(/U = = ^(EI^i WW^'^Wif^fi))^ ■ % 

adding the prox-function, we obtain a smoothed approximate function for f{W, h) as follows 

fx{W, h) = max{< W'^h -x,fl>2 -Ad2(/u) : fl G Q2} 

m ^ m 

= maxj^ltyW'^/J - X(,))/i(,) - -A^ IIW^^^^II^^) : G Q2}, (11) 



i=l 



1=1 



where A > is a parameter that controls the smoothness. The larger the parameter A, 
the smoother the approximate function f\iyV,h) and the worse its approximate accuracy. 
Using algebra, eq. (fTTj) can be written as 



h{W,h)=ni^^{{W'h-xYil-i^n' Afl:\fl(^)\ < 1}, 

p, L 



(12) 



where 



A 



A||l^(i)||t 







••• AIIVF^'")!!^ 

Let p G R"* and G to be the Lagrange multiplier vectors corresponding to the 



constraints, i.e., — 1 < and — 1 < 0, respectively, the K.K.T. conditions of ((12 

are as follows 

Wh - X - Ap - p + ip = 
P(i) > 0, > 

- 1 < 0, - 1 < 

I i-Pii) - ^)p(i) = 0, {p{i) - l)<^(i) = 
From ()13p . we can easily obtain the closed-form solution of (|12p as 



(13) 



/7( ■) = med{ 



Ij — Ij 7: 7^: I, i = 1, m, 

A||W^W||^ ^ 



(14) 



where med{-) is the median operator. By substituting fl* back into (fT2]) . we obtain the 
closed- form smoothed function f\{W,h) as 



... \W^'^^h-X(i) 



fx{w,h) = Y,\\w^'^riM 



i=l 



\\w(^)\\i 



(15) 



where V'A (''") 



fX' 0<r<A 



r- I, r > A 



. According to Theorem 1 in (|Nesterovl . [2nn3 ). fxiW, h) 



is well defined and continuously differentiable at any h £ Ei. Moreover, fx{W, h) is convex 
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and its gradient Vfx{W, h) = WfT is Lipschitz continuous with constant Lx = aS^||W^"^|Ii,2' 
wherein ^2 = 1 is the convexity parameter of d2{-) and ||Ty^||i,2 is the norm of projection 
matrix W which is defined as fohows 

m 

\\W^\\i,2 = max{^/2(,) < W^'\h>i: \\h\\i < 1, ||^||2 < 1} 

h,ll 

mm m 

< max{^ ir«iri^(.) : WW^'^Wlf^l) < 1} = E H^^^^Hi]^ = 

1=1 1=1 i=l 

By using the obtained smoothed function, ([9]) can be approximately solved by 

= avgmin^^^Jx{Wt,h). (16) 

Since fx(W,h) is smooth, convex and its gradient is Lipschitz continuous, it natu- 
rally motivates us to optimize ()16p by using Nesterov's optimal gradient method (OGM, 
(jNesterovl . 12004 )). In particular, OGM constructs two auxiliary sequences in optimization: 
one sequence stores the historical gradients and another sequence stores the search point 
that minimizes the quadratic approximation of fx(W,h) at the current solution. The step 
size is determined by the Lipchitz constant. In each iteration round, the solution is up- 
dated by combining the historical gradients and search point. This combination accelerates 
the gradient method and makes OGM achieve an optimal convergence rate of O(^) for 
optimizing (fT6l) . In this paper, the search points {ijk} and the historical gradients {zk} are 
defined as follows: 

Vk = argmin^gQj< Vfx{W,hk),y- hk >i +^\\y- HWl}, (17) 

and 

Zk = argmin,-.gQj^di(z) + Y'-^[fxiW,hi)+ < VfxiW,h,),z- hi >i]}, (18) 

^ i=0 

where fc > is the iteration counter. By solving (jl7p and (|18p . respectively, we have 

Vk = max(0, hk - -^VfxiW, hk)), (19) 

and 

4 = max(0,-— ^^V/a(W,/?0)- (20) 

^ 1=0 

According to ( Nesterov . 20041 ) . we combine yk and Zk as follows: 

K + 3 A; + 3 

By alternating between p9p . (j20p and (|2ip until convergence, we obtain the final solution 
h*x of (|16p . The convergence is checked by using the following objective-based stopping 
condition: 

\fx{W,hk)- fx{W,hl)\<e, (22) 
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where e is the precision, e.g., e = . 1. Since hX is u nknown in ahead, we usually use /i/t+i 
instead. According to Theorem 3 of (|Nesterovl . 12004 ) . the complexity of finding an e-solution 

|VK^||i.2 /d7d7 , r, / MD 



does not exceed N 



D1D2 I 9 
S1S2 ^ ^ 



-. By substituting b\ = ^2 = 1, M = 0, 



D2 = y, we have N = "UhlinL^ namely OGM finds an e-solution for ([9]) in 0{\) iterations. 

As mentioned above, the smooth parameter A controls the approximation of fxiW^h) 
for fiW^h), smaller A implies better approximation. A natural question is whether h\ 
minimizes ([9]) as A goes to zero. To answer this question, we first show that f(yV,h) is 
bounded and gets infinitely close to f\iyV^K) as A goes to zero in the following Theorem 
1 and we prove that the smoothing method finds an approximate solution of MahNMF in 
Theorem 2. Figure [2] gives two examples of the smoothing functions with different smooth 
parameters. It shows that the original non-smooth function is bounded. 





Figure 2: Two examples of smoothing function of the absolute function f when (a) A 
and (b) A = .05 with 6 = 10. 



Theorem 1 Given any positive number A > 0, we have the following inequality: 

D 



fx{W, h) < f{W, h) < h{W, h) + -A. 



Proof. Defining the residual error e = W^h — x, then its i-th entry is e(j) = W^'^^'^h — X(^iy 
The approximation function f\(W, h) can be written as the following function with respect 
to e: 



(23) 



10 



MahNMF: Manhattan Non-negative Matrix Factorization 



Below we will prove that \\w(^^ WlMj^wmf) < l%)l < \\W^'^\\lMj0^H For 
the convenience of derivation, we focus on the following function g\fi{x) = Otpx^^), wherein 
X > 0. According to the definition of V'a(') in we have 

2 ' — 

It is obvious that g\fi{x) < x < g\fi{x) + By substituting these inequalities into (plj) 
and considering D = we have fx{W,h) < f{W,h) < fx{W,h) + f A. This 

completes the proof. ■ 
Since the columns of H are separable, OGM can be written in a matrix form and 

\\wW\\l 

summarized in Algorithm [21 wherein Q 



X and Lx = ^jW^Wl 



1,2- 



Algorithm [2] accepts the smooth parameter A as an input and outputs an approximate 
solution of the sub-problem ([2]). 

Algorithm 2 OGM for Smoothed NLAD 

Input: X G R![^^", W G R^^"^, Ho G R^^", A, e. 
Output: H;^^. 



Initialize Q, Lx- 
For k = 0,1,2,... 

Compute Uk = med{l, -1, ^^^^~^ }. 
Compute Vf{W, Hk) = WUk- 
Compute Yk = max{0,Hk - ^Vfx{W,Hk)). 



1 

2 
3 
4 
5 

6 
7 
8 
9 

10: H^_^_^ = Hk+i 



Compute Zk = max(0, Eto '-^^ h{W,Hk)). 
Update //fc+i = ^Zfc + ^Yk. 
Check the stopping condition ([8|). 
End For 



According to ( Nesterov . 20041 ) . ^Z^orzi/im [2] converges at the rate of O(p-) for optimizing 



(|16p and needs O(^) iterations to yield an e-solution of the original problem ([9]). Since the 
distance between the primal and dual functions is 

4||H'^||?,Di 

° S «A) < AD. + ^^^^Jl < (25) 

where Di = max^{di(/I) : h £ Qi}, and fi = Xlilo (N+i)(N+2) f^^(^i)^ flxih) is the 
solution of (|2ip at the i-th iteration rounds. By minimizing the right-hand side of the 



above inequality, we have ^ = and -|- 1 < 4||l^-^||i^2 y ^T^- Since x ^ W^h, Di is 
bounded. However, this bound is difficult to calculate exactly. In the following section, we 
will show that this deficiency can be overcome by slightly modifying the feasible set Qi. 
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According to (|Nesterovl . |2004| ). sentence 5 of Algorithm [2] can be slightly changed to 
guarantee decreasing the objective function. In particular, we find = max(0, Hk — 
^Vfx{W,Hk)) and set ^ = aTgmmY{fx{W,Y),Y G {Yk_i,HkX}}- This strategy re- 
quires additional computation of the objective function and thus increases the time cost 
of each iteration by 0{mn). The main time cost of Algorithm [2] is spent on sentences 3 
and 4 to calculate the gradient, whose complexity is 0{mnr). Therefore, the total time 
complexity of Algorithm^is 0{2mn(r + 1)) x K, wherein K is the iteration number. 

According to Theorem 1, h\ gets infinitely close to the minimum of (jlOp . i.e., /i*, as 
A goes to zero. This motivates us to adaptively decrease the smooth parameter during 
each call of Algorithm^ The total procedure is summarized in Algorithm^which. sets the 
smoothing parameter inversely proportional to the iteration number and thus improves the 
approximation iteratively. In Algorithm^ the current solution, i.e., Ht and Wt^ is used as 
a warm start to accelerate the convergence of Algorithm^ {see sentence 3 and 4). 



Algorithm 3 Smoothing Method for MahNMF 
Input: X £ R![^^", A, <e, Aq. 

Output: W^:, H^:. 

1: Initialize Wq > 0, Hq > 0. 
2: For k = 0, 1,2, ... 

3: Solve Ht+i by Algorithm^ with input {X,Wt, Ht, t, ^t^)- 
4: Solve Wt+i by Algorithm M'^ith input (X, Ht+i,Wt,t, ef)- 
5: Update A* = 

6: Check the stopping condition Q. 
7: End For 

8: W, = Wt+i, = Ht+i. 



^/(/orzi/imOrecursively minimizes two smoothed objective functions, i.e., {Wt-, H) and 
f\t(W, Ht+i), as t goes to infinity. Although the generated point {{Wt+i, Ht+i)} at the t-th 
iteration is not the minimum of the original sub-problems, the following Theorem 2 shows 
that {{Wt+i, Ht+i)} do decrease the objective function f{W,H) as t goes to infinity. Since 
the objective function f{W,H) is lower bounded, ^/(^orit/im [3] converges to an approximate 
solution of ([T]). 

Theorem 2 The sequence (Wj+i, //j+i) generated by Algorithm ^decreases the objective 
function of MahNMF, i.e., for any t > 0, f{Wt+i,Ht+i) < f{Wt,Ht) 

Proof. Without loss of generality, we take the i-th iteration round for example and show 
that Algorithm [3] decreases the objective function. Given Wt, the sentence 3 of Algo- 
rzi/im [3] implies that Ht+i = Siigm.\nH>o f(^t){Wt-,H). Therefore, we have f\^{Wt,Ht+i) < 
f\^{Wt,Ht). According to Theorem 1, we have 

n n „ 

f{Wt, Ht+i) = fiWt, Hi,) < ^ {Wt, Hl^,) + ^At = h,{Wt, Ht+i) + ^A*. 
i=i i=i 

and fx^{Wt, Ht) < f{Wt,Ht). Then we immediately have the following inequalities 

n D 

f{Wt,Ht) > fxAWt,Ht) > fxAWt,Ht+i) > f{Wt,Ht+i) - — A*. (26) 
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From m, we get that f{Wt,Ht) - f{Wt,Ht+i) + ^Xt > fM,Ht) - fxAWt,Ht+i). 
By setting the precision for Algorithm^as ef < and using the objective-based stop- 
ping condition we have fx,{Wt,Ht) - fxAWt,Ht+i) > ^Xt, and thus f{Wt,Ht) > 
f{Wt,Ht-f^i). In a similar way, we can prove that f{Wt,Ht+i) > f {Wt+i , Ht+i) . Therefore, 
f lWt,Ht) > f{Wt+i,Ht+i). This completes the proof. ■ 

In the proof of Theorem 2, we need to set the precision of Algorithm [2] as < at 
the t-th iteration round. By substituting Xt = we have < 2(^+1) ■ Therefore, e^^ may 
go to zero as t goes to infinity and this setting will make Algorithm^Ry without stopping. 
Fortunately, since ^/^orit/im [2] converges at the rate of 0(p-, it needs only 0{y/t) iterations 
to reach precision which is quite cheap. For example, suppose Algorithm [3] converges 
within T < 10^ iterations in its worst case. Algorithm^ needs only around 100 iterations 
to reach precision when t <T. This means that such an assumption is usually satisfied. 
Therefore, ^Z^orit/im [3] obtains an approximate solution for MahNMF when it stops. 

The main time cost of Algorithm [3] is spent on sentences 3 and 4 which call Algorithm 
[2] to successively update H and W, respectively. Since the time complexity of Algorithm^ 
is 0{2mn{r + 1)) x K and the iteration number K depends on the specified precision, the 
time cost of the t-th iteration of Algorithm^is 0{2mn{r + l)\/t). Therefore, the total time 
complexity of Algorithm [3] is 0{2mn{r + 1) Ylii=i V^)' wherein T is the iteration number. 
Empirically, T is small, e.g., T < 100, and thus A Z^orzi/im [3] converges fast. 

In summary, the proposed Nesterov smoothing method-based algorithm costs less CPU 
time in each iteration than the proposed RRI method whose time complexity is 0{mnrlogm+ 
mnr) x K, wherein K are the iteration numbers of Algorithm [TJ but the RRI method con- 
verges in fewer iteration rounds because it finds a closed form solution for each variable of 
both factor matrices. Therefore, the performance of the proposed RRI algorithm and the 
smoothing-based algorithm is comparable for optimizing small scale MahNMF. However, 
the smoothing method is much more scalable than RRI due to its lower time complexity. 
Thus we suggest choosing the Nesterov smoothing method-based algorithm to optimizing 
MahNMF and its variants. 



4. MahNMF Extensions 

MahNMF provides a flexible framework for developing various algorithms for practical ap- 
plications. In this section, we extend MahNMF by integrating box-constraint, manifold 
regularization, and group sparsity, and develop elastic net inducing MahNMF and symmet- 
ric MahNMF for several computer vision tasks. 



4.1 Box-Constrained MahNMF 

When the observations satisfy a box constraint such as < < 1 for any 1 < i < m, it 
is reasonable to assume that the entries of W and H fall into the domain [0, 1]. Based on 
this observation, we extend MahNMF to a box-constrained MahNMF (MahNMF-5C) as 
follows 

n<w <Tn"<^ f^''\W,H) = \\X-W''H\\i,. (27) 

It is natural to adopt the proposed Nesterov smoothing method to optimize (I27p . It 
is surprising that Algorithm ^hecomes much more efficient in this case. In particular, we 
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replace the feasible set of the box-constrained non-negative least absolute deviation (NLAD- 
BC) problem with Qf'^^ = {h : < h(^j^ < 1} and keep all the other definitions consistent. 
Instead of ()17p and (llSh . we solve the following two problems for the two auxiliary sequences: 

ifk"^ = T^fc){hk) = argmin^^^(bc){< Vfx{hk),y-hk >i +^\\y - hk\\l}, (28) 



and 



k 

Jibe) _ . r ^ / I ^ + ^ 



^k = ar: 



gmin_,^^(,.){-Acii(z) + Y^—-[f^{hi)+ < fx{hi),z- hi >i]}, (29) 
^ ^ i=0 
whose solutions are as follows: 

yi'^) = medio, r, hk - ^V/a(4)), (30) 

and 

4"'^) = medio, 1, ^ 1±1v/a(K,)). (31) 

By using the box constraint, it is quite easy to compute the bound of the prox-function for 
Q^i^\ i.e., d[^^^ = |. Based on the obtained bound, it is easy to compute the dual function, 
i.e., 

(^) = min{< W^h-x,fl>i: h G Q^^^} =< W^^iiWp) -x,fl>i. 

h 

Thanks to the closed-form dual function (p^^^\jl), eq. (f25l) can be used to check the con- 
vergence of Algorithm [2j It greatly cuts down the time cost of Algorithm [2] because the 
calculation of objective function fxiW,h) is withdrawn. 

The RRI method can also be naturally adopted to optimize MahNMF-5C because the 
only difference between MahNMF and MahNMF-5C is on their feasible sets. In particular, 
we keep all the other parts of Algorithm^consistent except sentence 8. After obtaining the 
piecewise point p^c^ , the closed form solution for sentence 8 for each variable in MahNMF- 
BC is replaced by Hk+i^i = med{0, IjPg^j } for any I G {1, r} and j G {1, n}. 

4.2 Manifold Regularized MahNMF 

When the observations distributed on the surface of a manifold are embedded in a high- 
dimensional space, one is interested in preser ving the geometry struct ure in the learned low- 
dimensional space. Manifold regularization ( Tenenbaum et al. . 200d ) aims to preserve this 
geometry structure and constructs an adjacent graph G to capture the neighbor relationship 
between one observation and a few of its nearest neighbors. By minimizing the distances 
between each observation and its corresponding nearest neighbors in the low-dimensional 
space, it preserves the geometry structure, i.e., 

mintriHL^^'iH'^), (32) 

where L^*^) is the Laplacian matrix of G. By combining (j32p and ([T|), we extend MahNMF 
to a manifold regularized MahNMF (MahNMF-M), i.e., 

rnin /W(l^,i7) = \\X - W^^ H\\m + ^triHL^^^ H^), (33) 

W>0,H>0 Z 
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where /3 > is the tradeoff parameter. 

MahNMF-M can be solved by using alternating optimization over W and H with Algo 
rithm [2] and slightly modified Algorithm [21 respectively. We term the optimi zation proce- 



dure oi H a manifold regularized NLAD (NLAD-M) problem. According to (jGuan et al 



20]J), the second term of f'^^'^\W,H) is convex and its gradient is Lipschitz continuous 
with constant L^'-'\ Therefore, to solve NLAD-M, sentences 5 and 6 in ^/(/orit/im [2] should 
be replaced by 

ri^) = max(0,i?, - -±^Vfi''\H,)), (34) 



and 



Zr = max(0, ' ^V/f )(//,)), (35) 



L 



(M) 

A j=0 



where V f[^\Hk) = + (^HkL^^^ and ^ =L^ + (3L(G). 

In addition, the proposed RRI method can also be adopted to optimize MahNMF-M. By 
using the residual matrix Z defined in (0) and considering the l-th row of H, the objective 
function ([33]) can be equivalently rewritten as 



min /W(VF, //(,)) = \\Z - W^(T)i?(0 lU/ + ^H^i)L^^^ H^iy (36) 



Given all the variables in except Hf^ij-^, eq. (j36p is equivalent to 

m „ n 



where Saj > is the (a,j)-th element of the similarity matrix for adjacent graph G. Since 
f(^^^(W, Hfj^ j^) is actually a continuous, convex, and piecewise function, we can easily obtain 
its closed form solution based on the following Theorem 3. Supposing the minimum of 
/(^)(VF,i/(ij)) is H'f^^.y the optimal solution of ^ is H*^i .-^ = max(0, ^.p. Note that 

H'^i is selected from the piecewise point set {^^^i = l,...,m} according to Theorem 3. 



If Z contains all zero, then the optimal solutions of ([36]) will be trivial, i.e., H'^i^ = 0. To 
overcome this problem, we need to initialize both W and H hy a small value, e.g., 10~^^. 
In our experiment, this initialization strategy works well. 

Theorem 3 Given f{x) = Yl^i kj(a; — Xi)\ + b{x — d)^, wherein > and b > and 
xi < • • • < Xm- Define /cj+i = ki + 2aj and ki = — oi — • • • — a^- If 2b{xi — d) + ki^i < 
and 2b{xi+i — d) + ki+i > for a some i, then the minimum of f{x) is 



d-^„ i = 
d-%±i, i = m 



maxjxj, d i^"}; ^ ^ {li "t- ~ !}• 
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Proof. Using algebra, /(x) can be written as a piecewise quadratic function as follows: 



fix) 



h{x — d)^ + kix + ci, xq < X < xi 

h{x — d)"^ + k2X + C2, X\ < X < X2 

h{x (i)^ -\~ kriYiX -\- Cyji^ XtYi—\ ^ X ^ X' 
b{x - df + km+ix + c, 



-m+ 1 ; 



•m 

X-m ^ X <C Xjn+l 



m+1 



(38) 



oo 



where ci = aixi + - • •+amXm and Cj+i = Ci — 2aiXi. Here we define xq = — oo and x 
for the simplicity of presentation. Since the first part of f{x) is convex and the second part 
is strongly convex, /(x) is totally strongly convex, and thus it has an unique minimum x^,. 
According to (p8]) . we obtain the slope of /(x) as follows: 

2b{x — d) + ki, Xq < X < xi 
2h{x — d) + k2, xi < X < X2 



fix) 



(39) 



26(x d) + kmj Xfn—l ^ X < XjYi 
^ 2b{x - d) + km+l, Xm<X< Xm+1 



Here we define / (xq) = — oo and / (xm+i) = oo for the convenience of derivation. It 
is obvious that / (x) is non-continuous; we define the left slope and right slope at each 
piecewise point Xi as /_(xi) = 2b{xi — d) + ki with i E {!,..., m + 1}, and /+(xj) = 
2h{xi — d) + fei+i with i G {0, ...,m}, respectively. Since f{x) is continuous and strongly 
convex, x* is unique and it appears at the point that first changes the sign of / (x). Suppose 
/+(xj) < and /_,_(xi+i) > 0, wherein i E 0, ...,m, we have Xj < x^, < Xj+i. If z = 0, we 
have x=K = d—^ because /(x) is a quadratic function on the set (xj, Xj+i]. If i = m, we have 
x^ = d — because /(x) is a quadratic function on the set [xj, Xj-|_i ). If i G {1, ...,m-l}, 
/(x) is a quadratic function on the set [xj,Xi+i], we have x=k = me(i{xj, Xj+i, d — ^j^}- 
Since /^(xj+i) > 0, we have d — -^^^ < Xj+i, then x* = max{xj,d — ^|^}- It completes 
the proof. ■ 
Although RRI can be applied to the optimization of MahNMF-M, it is time-consuming 
because the variables must be updated one by one. We suggest the proposed Nesterov 
smoothing method for optimizing MahNMF-M 



4.3 Group Sparse MahNMF 

Since NMF does not explicit ly guarantee sparse representation, Hoyer proposed sparseness- 
constrained NMF (NMFsc, (jHoveii . 12004 )) to incorporate sparseness constraint on single or 
both fa ctor matrices. Rece n t results show that many data sets are inherently structured as 
groups teengio et al. . 20091 ) ( Huang et all . 20091 ). i.e., some of the data items or features that 
belong to the same group share the same sparsity pattern. For example, different types of 
features such as pixels, gradient-based features, and color-based features of an image can be 
considered as different groups. Such prior knowledge of group sparsity greatly improves the 
effectiveness of sparse repres entation and has be successfully applied in many methods, e.g., 
group Lasso (jYuan and Linl . 12006 ) . It motivates us to introduce group sparsity to explicitly 
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improve the sparse representation of MahNMF. The objective of group sparse MahNMF 
(MahNMF- G^) is as follows: 



min \\X - W''H\\M,s.t.,yp G Gw, \\W^'^^\\i,p < ^w,yp G Gh, ll^'^'^Hi.p < 7H, (40) 



or 

unn \\X-W^H\\M + m ^ W^'^'^Kv + m E (41) 

W>\},H>0 '■ — ' — ' 

where X'^l signifies the columns of X indexed by group /?, and Gw C 2ii'-''"> and Gh C 
2{i. ■■■,"-} are the grouping sets of columns of W and -ff, and 7vk and control the group 
sparsity of W and -ff, respectively. The tradeoff parameters r/^y > and 'qn > control the 
group sparsity over W and if, respectively. The group sparsity is usually defined by using 
Li^p-norm which is defined as = X^j=i I ll-f^'^"'^^ llpl for any X G R"^'', wherein p > 1. 

Usually, we choose p = 2, oo for group sparsity. In the following section, we will show that 
both (jlUj) and (|^T]) can be solved by slightly modifying the proposed Nesterov's smoothing 
method. 

To solve (|40|) . we modified Algorithm^hj redefining the feasible set of W and H as 

gjf.^) = {W^ R^x'^V/, G Gw. W^'^^h.v < lw,Gw C 2^1'-'"} (42) 

and 

QSf^) = {HG R;^"Vp G Gh, II^'^'^IIlp < 7h, Gh C 2^''-'^\ (43) 

Based on the alternating optimization method, given W, H can be optimized by cycling on 
variables indexed by Gh because the groups are non-overlapping. For any group p G Gh, 
the objective for optimizing H^^^ is 

min \\X^P^ -W'^H^p^Im, (44) 

(as) 
hIp] 



where Qj^'j, = {//W G R+''"''|||-H'W ||i,p < jh}- It is obvious that Q^^'j, is closed and 
convex set, and thus (I44p can be solved by slightly modifying Algorithm [2j Particularly, 
at the k-th iteration, the sequences and can be obtained by solving the following 
problems: 

yl"! = argmin^; {< /.(H^,^^),^^ - >i - ^^'HM, (45) 

and 



zlfU arg min^,,,Q^^,,)(,., llZl^'l III + J] i±i [A(W^, 



1=0 



+ < Vh{W,Hl''^),Zlf^ - >i]}, (46) 

respectively. Both (|45p and (j46p essentially minimize a quadratic function over a convex 
set, and thus they can be solved by projecting the minimum of the corresponding quadratic 
functions as follows: 

^i"' = n (4'' - ^^hiW, hP)), (47) 
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and 

k 

= n (-^E^WA(w^,i^?^)), (48) 



where nntf") (-^^ projects X onto <5^|*p]- The projection operator can be defined as 



min -\\X-X\\l. (49) 

^>0,||X||i,p<7H 2 



According to (jTandon and Sral . the non-zero entries in the optimal solution, namely 



X^K, of (4.14) share the same signs as those in X. Therefore, eq. (j49p can be solved by 
projecting the absolute of X onto the h^p ball, i.e., 

^* =pro^^(max(0,X)). (50) 

When p = 2, the projection can be done by using Berg's algorithm ( Berg et al. . 20081 ) in 



O(rnp) time. When p = oo, the projection is completed by using Quattoni's algorithm 
( Quattoni et al. . 20091 ) in 0(rnp log n^) time. Therefore, the proposed Nesterov smoothing 



method-based algorithm can be applied to optimizing (j40p without increasing the time com- 
plexity. Moreover, the following section will show that it can also be adopted to optimizing 

Although the objective function of (I4ip is non-convex with respect to W and H simul- 
taneously, it is convex with respect to either W or H. Therefore, eq. (j4ip can be solved by 
alternatively optimizing W and H. Take the sub-problem of optimizing H (called group 
sparse NLAD or NLAD-GiS" for short) for example, its objective function is as follows: 

mm\\X -W'^HWm + Vh Y1 (^1) 

Since both \\X — W'^H\\m and YlpeGn ll-^'''''^lli>p convex, eq. (|5T]) has an optimal 

solution. However, it is involved because neither ||X — W'^H\\m nor X^^gj^^ ||ff[''l"^||i^p is 
smooth. Fortunately, the OGM method (see Algorithm^ can be slightly modified to solve 
it efficiently. By using the smoothing function fx{W,H), eq. (f5T|l can be approximated by 

mmhiW,H)+rjH ^ (52) 

Since Gh is non-overlapping, fx{W,H) is separable, i.e., fx{W,H) = Z^pgCn 
Eq. ()52|) can be solved by recursively optimizing each group of variables, i.e., 

min h{W, H^P^) + r/^^||i/W^||i,p, (53) 

_H'W>0 



where p € Gr- In order to solve (j53l) by using the OGM method, we construct additional two 
auxiliary sequences, i.e., vjf^ and Z^^\ wherein /c > is the iteration counter. Since OGM 
essentially constructs the 'Y' sequence by optimizing a linear approximation of f\iyV, •) at 
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jjM regularized by a quadratic proximal term, we propose to construct Y^'^^ by optimizing 
the following objective function 

yI"^ = argmin^M,Qj< V/A(Ty, 4"^), - >i +^\\Y^^^ - hI'^WI + m\\Y^'^^ \\i,p]}- 

(54) 

Because (j54p employs no approximation on the non-smooth part, such approximation will 
not decrease the convergence rate of Algorithm [2] if (154^ can be efficiently solved. Fortu- 
nately, the answer is positive. Using algebra, eq. (j54p can be equivalently rewritten as 



yI'^ = argmin^MegJ^r^''^ " (4"' " fx{W, Hi'^ml 



^ ■— " fW^ii? 



2L 



= argmin^[,i,Qji||yW - {hI'^ - -1-V/a(H^, + ^jY^'^^h,,}. (55) 

According to (jTandon and Sra . l20ld l. eq. (|55p reduces to the well-known proximity 
operator problem, i.e., 

y^^'l =argmin^[,l{i||y[''l -n(4''^ " ^V/a(H^, F^))!!? + ^llyW'^lli.p}. (56) 

This problem can be solved by first solving its dual problem and projecting the solution 
back to solve the primal problem. When p = 2, the dual problem of (j56p can be easily 
solved by normalization. When p = oo, its dual problem is equivalent to ^i-norm projecti on 
that can be efficiently solved by using Duchi's method in linear time ( Duchi et al. . 20081 ). 

Since OGM constructs the 'Z' sequence by optimizing a combination of the linear ap- 
proximations of fx{W, •) at the historical search points regularized by a quadratic term, 
similar to (I54p , we can construct Z^f^ by adding the non-smooth part to each linear approx- 
imation, i.e., 

k 

Zf = argmin^„i{:^||zW|||. +^i±i(A(W^, //]"])+ < V fxiW^Hf^), Z^^^ - H^^ >, 

i=0 



+ Vh\\Z^''^ \\i,p)} 



k 

argmmz[p]i — \\z^-' \\f + '^^^(;x[W, ti.- )+ < y !x[,w , n- ' }, z.^-^ - n- ■ >ij 

1=0 

r?H(fc + l)(fc + 2)„ r,iT, 



4 

k 

argmin^,„{i||zW-n(-^E^WA(W^,^]''b)||? 



La ^ 2 



,nik+m+2)^^^,,T ^^^^ 
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which is also a proximity operator problem and can be efficiently solved in a similar way to 

m- 

By replacing the sentences 5 and 6 in Algorithm^ with (|56p and (j57p . respectively, eq. 
(|53p can be solved by modifying Algorithm^ and the new algorithm converges at the rate of 
O(p-). We leave the proof to future work due to the limit of space. Empirical results show 
that the smoothing method for MahNMP-GS" converges rapidly. MahNHF-GS* is useful in 
many problems especially in multi-view learning. We will evaluate its effectiveness in the 
following section. 

4.4 Elastic Net Inducing MahNMF 

MahNMF decomposes a given non-negative matrix into a non-negative low-rank part and 
a sparse part. However, if there is a group of nonzero variables in the sparse part which 
are highly correlated, MahNMF tends to select only one variable from the group regardless 
which one is selected. Th at is because Mah NMF introduces the sparity ove r the sparse part 
in a same way as Lasso (Tibshirani, 19961 ). In contrast, Zou and Hastie ( Zou and Hastie . 



20051 ) proposed an elastic net method to take into account the grouping effect of variables in 
regression. Elastic net minimizes the least squares loss function combined with both li norm 
and I2 norm over the coefficients and thus selects groups of correlated variables. Here, we 
introduce the main idea of Elastic net into MahNMF to take its advantage. In particular, 
we expect the highly correlated nonzero variables in the sparse part to be grouped by 
minimizing both the Manhattan distance and Euclidean distance between a non-negative 
matrix X and its non-negative low-rank approximation W^H simultanously. We termed 
this extension elastic net inducing MahNMF (MahNMF-£^iV) whose objective function is 

unn &^\W,H) = \\X - W^H\\m + ^\\X - W^Hfp, (58) 

where a > balances the Manhattan distance and the Euclidean distance between X and 
W^H. 

Since f^'^"'\W,H) is composed of a smooth part and a non-smooth part, the Nes- 
terov smoothing method can be naturally applied to optimizing (j58p . Since 
is non-convex, we solve (I58p by recursively optimizing W and H until convergence. Given 
W, H is updated by solving the elastic net inducing non-negative least absolute devia- 
tion (NL AD - £'A^) problem and W can be updated similarly with H fixed. According to 
(|Guan et al.l . boid l. the second term f ||X - W^H\\l of ([58]) is convex and its gradient is 



Lipschitz continuous with constant a||VFH^^||2, wherein || • || signifies the matrix spectral 
norm. Therefore, the proposed Algorithm [2] can be applied to optimizing NLAD-ii'A'^ by 
replacing the gradient and Lipschitz constant with V f^"\hk) = ^fx{hk) + aW{W'^hk — x) 
and L^^"^ = L\ + a\\WW'^\\2-, respectively. That is to say, the proposed Nesterov smooth- 
ing method, i.e.. Algorithm [3l can be naturally adopted to solve MahNMF- E'A'^ without 
increasing the time complexity. 

In MahNMF- ii^A^, the trade-off parameter plays a critical role to control the grouping 
effect of the s parse part. This param eter can be carefully selected based on the strategy 
introduced in (|Zou and Hastiel . I2OO5I I. 
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4.5 Symmetric MahNMF 

In spectral clustering, the data matrix X is the Laplacian matrix or normalized Laplacian 
matrix of the specified adjacent graph. In these cases, the data matrix is symmetric and 
it is reasonable to cut down the number of variables by assuming W = H. Inspired by 
spectral clustering, we extend MahNMF to symmetric MahNMF (or MahNMF-5KM for 
short), i.e., 

mm\\X - HH'^Wm, (59) 

where H £ R"^'' and r <^n. 

Since (|59p is neither convex nor smooth, it is an involved problem. Fortunately, the 



proposed RRI method can be applied to successively update each variable of in a closed 
form solution. In particular, eq. ()59p can be equivalently rewritten as X ~ H^^^H^-^^ + 
• • • + . To optimize each column H^^^ of H, wherein c G {1, r}, we fix the other 

columns and solve the following problem 

min \\Z - H^^^H^^'^^Wm, (60) 
>o 

where Z = X — '^^^^H^^^ H^^^ is the residual matrix. Moreover, Eq. ()60p can be solved by 
successively updating each variable. Considering variable ^^(j,c) with other variables fixed, 
we have 



min I Zyjj - Hfj^^^ I + X] I ^(^.i) ~ ^{^,c)H(^j^c) I ■ (61) 

Assuming H'^j ^.^ > dSl]) can be rewritten in a similar form to (f37|l . and thus 

it can be updated in a closed form solution by using Theorem 3. The inequality H'^- > 

Zq-j) means that "^21 -^fj i) — -^{jj)^ which can be easily satisfied by normalizing X. RRI 
converges fast because the variables are updated in a closed form solution. 

MahNMF-^YM is useful in practice especially for spectral clustering. In the following 
section, we will show that MahNMF-S'FM can be successfully applied to image segmentation 
and discuss its relationship to normalized cut ()Shi and Malikl . l2nnnl ). 



5. Experimental Results 

In this section, we first compare the efficiency of the rank one residual iteration (RRI) 
method with that of the Nesterov smoothing method for optimizing MahNMF on both 
synthetic and real- world datasets. Subsequently, we study the effectiveness and robustness 
of MahNMF by comparing it with EucNMF and KLNMF by conducting face recognition 
and clustering on both Yale B and PIE datasets. We conduct image segmentation with 
MahNMF-SYM to study its clustering effectiveness. We then study the sparse and low-rank 
decomposition capability of MahNMF by conducting background and illumination modeling 
on video sequences and challenging face images dataset and comparing it w ith both robust 



princ ipal component analysis (RFC A, (jCandes et al.l . l201ll )) and GoDec, (jZhou and Tad . 



l2niih . Finally, we apply the MahNMF- algorithm to multi-view learning on two chal- 
lenging datasets including VOC Pascal 07 and Mirflickr to show its effectiveness. 
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Figure 3: Objective values versus iteration numbers and CPU seconds on 100 x 50-D (a &: 
b) and 1000 x 500-D (c &: d) synthetic datasets. 



In this experiment, we use the multiplicative update rule ( Lee and Seung . 19991 ) to opti- 
mize KLNMF which stops when the indices of the column maximums of H do not cha, i iKe for 
40 consecutive iterations. We apply the efficient NMF solver NeNMF ( Guan et al. . 20121 ) 
to optimize EucNMF and use the projected gradient norm-based criterion as a stopping 
condition with the precision setting to 10~^. The MahNMF stops until the stopping condi- 
tion ([4]) is satisfied with precision setting to 0.1. The smoothness parameter in Algorithm 
[3] is initialized to Aq = .1 to guarantee fast convergence in the first steps and decrease 
dramatically to improve the approximation accuracy. 



5.1 RRI versus Nesterov's Smoothing Method 

As discussed above, both the rank-one residual iteration (RRI) method and Nesterov 
smoothing method, i.e., OGM, can be applied to the optimization of MahNMF and its 
extensions. One may expect to have to choose between them in practical applications. 
This section tries to address this issue by comparing their efficiency on both synthetic and 
real-world datasets. 

To study the scalabilities of both algorithms, we conducted them on 100 x 50-D and 
1000 X 500-D dense matrices, and set the reduced dimensionalities to 5 and 50, respectively. 
For fairness of comparison, both algorithms start from an identical randomly generated 
dense matrix. Since MahNMF is non-convex, the initial point has a high impact on the 
obtained solution. To filter this initialization impact, we repeat this experiment ten times 
and compare their average objective values and standard deviations versus iteration number 
and CPU seconds in Figure [3l 
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Figure 4: Objective values versus iteration numbers and CPU seconds on the Yale B (a & 
b) and PIE (c & d) datasets. 



Figure [3] (a) and (b) show that RRI converges in fewer iteration rounds and CPU seconds 
than OGM on the small scale data matrix. This is because a single iteration of RRI and 
OGM has a comparable time cost on a small scale data matrix while OGM obtains a closed 
form solution for each variable and further reduces the objective function. However, when 
the scale of the data matrix increases, OGM costs far fewer CPU seconds in each iteration 
round than RRI because it converges much more rapidly than RRI (see Figure [3] (c) and 
(d)). It confirms that OGM is more scalable than RRI. 



We also con ducted both RRI and OGM on two real- world datasets, i.e., Yale B (iGeorghiades et al 



200lh and PIE dSim et alLliooi ) face image datasets. The extended Yale B and PIE datasets 



contain 16, 128 and 41, 368 face images taken from 38 and 68 individuals, respectively. Each 
image is cropped to 32 x 32 pixels and reshaped to a long vector. In this experiment, we ran- 
domly select seven images of each individual and construct a 1024 x 266-dimensional matrix 
and a 1024 x 476-dimensional data matrix, respectively, for MahNMF learning. Similar to 
the above experiment, we set the reduced dimensionality to 5 and 50, respectively. Figure 
m gives their objective values and standard deviations versus iteration number and CPU 
seconds. From Figure HI we have the same observations as those obtained from Figure [3j 

In summary, when the scale of the data matrix and reduced dimensionality are ordinarily 
small we suggest optimizing MahNMF by using RRI. When the scale of the data matrix 
and reduced dimensionality are relatively large we suggest optimizing MahNMF by using 
Nesterov's smoothing method to take its advantage of scalability. 
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5.2 Face Recognition 



We study the data representation capacity of MaiiNMF by conducti ng face recognition 



expe riments on t wo challenging f ace image datasets including Yale B (jGeorghiades et al 



200lh and PIE dSim et alJ . liooi ') and making a comparison with two traditional NMF 



algorithms including EucNMF and KLNMF. We randomly select seven images of each 
individual to construct the training set Xtrain and the remaining images make up the test 
set Xfest- To eliminate the effectiveness of random selection, we repeat this trial ten times 
and report the average accuracy and standard deviation. All the NMF algorithms are used 
to factorize the training set into the product of basis W and the compact representation 

Htrnin, i.e., Xtrnin ~ Hh 



strain ■ 



Algorithm 4 Traditional NMF-based face recognition 

1: Compute Htrain = W'^'' Xtrain, Htest = W'^'^ Xtest- 
2: For i = 1,2, ...,ntest 

3 : Compute j = arg miu; { 1 1 fff^^^ - H^'J^^ 1 1 , J . 

4: Transfer xf;!^ 's label to X« , . 
5: End For 



The traditional NMF-based face recognition method obtains the representations of the 
test set by projecting Xtest onto the learned space as Htest = W'^^Xtest, wherein W'^^ is 
the pseudo inverse of W'^ . Algorithm^^summavizes this method which transfers the label of 
Euclidean distance-based nearest neighbor (NN) in the training set to the given test sample. 
Although this method is efficient and easy to implement, the pseudo inverse operator may 
bring in negative elements and thus it is not robust to outliers. 



To overcome the aforementioned drawback, Sandler and Lindenbaum ([Sandier and Lindenbaum 



201 ll ) suggested another NMF-based face recognition method for the corrupted training and 



test set (see Algorithm\S^. This method finds the best compact representation of the test 
sample on the learned basis and transfers the label of cosine distance-based nearest neighbor 
in the training set to the given test sample. In Algorithm\5l D(-, •) is determined based on 
the NMF algorithm used, e.g. Manhattan distance for MahNMF. The accuracy is calculated 
as the percentage of test samples that are correctly classified. 



Algorithm 5 Sandler-Lindenbaum's NMF-based face recognition 

1: Compute Htest = avgmmH^^^_^>oD{Xtest,W'^ Htest- 
2: For i = 1,2, ...,ntest 

3: Compute ?' = argmin,||TT7 — "'fr'nr^""* n |- 

4: Transfer X^^Js label to X^,. 
5: End For 



To evaluate the robustness of MahNMF, we add five types of outliers including occlusion, 
Laplace noise. Salt & Pepper noise, Gaussian noise and Poisson noise to the training set. 
In the classification stage, we conduct Algorithm |3] and Algorithm [5] on the clean and 
contaminated test sets, respectively, to evaluate the robustness of MahNMF under different 
settings. The experimental results of MahNMF under both settings are encouraging. 
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Figure 5: Face recognition accuracy versus dimensionalities of PCA, EucNMF, KLNMF, 
and MahNMF on the Yale B dataset when the training set is contaminated by 
occlusion, Laplace noise, salt & pepper noise, Gaussian noise, and Poisson noise. 
All the algorithms were evaluated under two settings: the test set is clean (a-f) 
and the test set is contaminated by the same noise as the training set (g-1). 



5.2.1 Yale B Dataset 



The extended Yale face database B ( Georghiades et aD . boOll ) contains 16, 128 images of 



38 individuals under 9 poses and 64 illumination conditions. All the images are manually 
aligned and cropped to 32 x 32 pixels. We simply selected around 64 near frontal images un- 
der different illuminations per individual and reshaped each image into an 1024-dimensional 
long vector. We conducted MahNMF, EucNMF and KLNMF on the training set contam- 
inated by five types of outlie rs with reduced dimensionalities varying from 10 to 150. The 
eigenface obtained by PCA (iHotellind . \l93^ ) was used as a baseline. Figure [S] gives their 



average accuracies and standard deviations. 

Figure [5] (a) and (g) show that MahNMF outperforms both EucNMF and KLNMF on 
the Yale B dataset because it is robust to outliers caused by illuminations and shadows. 
To further study the robustness of MahNMF representation, Figure [5] (c), (d), (i) and (j) 
show that MahNMF significantly outperforms both EucNMF and KLNMF on the training 
set contaminated by Laplace noise and Salt & Pepper noise because MahNMF success- 
fully models such heavy-tailed noises. On the other hand, Figure [5] (e) and (k) show that 
MahNMF does not perform well when the training set is contaminated by Gaussian noise 
because it violates the assumption of MahNMF. Similarly, Figure [5] (f) and (1) show that 
MahNMF is comparable to KLNMF when the training set is contaminated by Poisson noise. 



25 



N. GuAN, D. Tao, Z. Luo, and J. Shawe- Taylor 



Figure [5] (b) and (h) show that MahNMF performs robustly in presence of occlusion because 
it successfully suppresses the outliers. 




(a) (b) W W) 



Figure 6: Face image examples (column a) of Yale B dataset and the learned basis by 
KLNMF (column b), EucNMF (column c), and MahNMF (column d) in the 
absence of noise (1st row) and in the presence of occlusions (2nd row), additive 
Laplace noise (3rd row), Salt & Pepper noise (4th row), Gaussian noise (5th row), 
and multiplicative Poisson noise (6th row). 

To further study the effectiveness of MahNMF in data representation, we randomly 
selected five base vectors from the learned basis by different NMF algorithms when the 
reduced dimensionality is 50. Figure [6] compares the base vectors learned by MahNMF 
with those learned by KLNMF and EucNMF. The first four rows show that MahNMF 
successfully supresses the occlusion, Laplace noise, and Salt & Pepper noise while both 
KLNMF and EucNMF representations are contaminated. It is interesting that MahNMF 
also suppresses the Poisson noise which confirms the observation in Figure El That is 
because the Poisson distribution is also heavy-tailed to some extent. 



5.2.2 PIE Dataset 

The CMU PIE face image database ( Sim et al. . 20031 ) contains 41,368 images taken from 
68 individuals under 13 different poses, 43 different illumination conditions, and 4 different 
expressions. All the images have been aligned according to the eye position and cropped 
to 32 X 32 pixels. We simply selected 42 images per individual at Pose 27 under different 
light and illumination conditions and reshaped each image into an 1024-dimensional long 
vector. Figure [7] gives the average face recognition accuracies and standard deviations of 
PCA, EucNMF, KLNMF, and MahNMF representations. 

Figure [7] (a) and (g) shows that MahNMF outperforms EucNMF and its performance 
is comparable to KLNMF on the PIE dataset. Figures [7] (b) to (j) show that MahNMF 
significantly outperforms both EucNMF and KLNMF when the training set is contaminated 
by occlusion, Laplace noise, and Salt &: Pepper noise. Figure [7] (a), (d), (g) and (j) show 
that MahNMF performs almost perfectly on the PIE dataset even when both the training 
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Figure 7: Face recognition accuracy versus dimensionalities of PCA, EucNMF, KLNMF, 
and MahNMF on the PIE dataset when the training set is contaminated by 
occlusion, Laplace noise, salt & pepper noise, Gaussian noise, and Poisson noise. 
All the algorithms were evaluated under two settings: the test set is clean (a-f) 
and the test set is contaminated by the same noise as the training set (g-1). 



and test sets are seriously contaminated by Salt & Pepper noise. This is because the PIE 
face images are mainly contaminated by illumination which can be successfully removed by 
using the low-rank and sparse representation of MahNMF. For similar reasons, MahNMF 
outperforms both EucNMF and KLNMF even when the training set is contaminated by 
Gaussian and Poisson noises (see Figure [7] (e) to (1)). 

Figure [8] gives the randomly selected bases learned by EucNMF, KLNMF and MahNMF 
when the reduced dimensionality is 50. It shows that MahNMF successfully suppresses 
occlusion, Laplace noise. Salt & Pepper noise, and the illumination in the training set. 
Therefore, Figure M supports our observations in Figure [71 

One may be interested in the reconstruction capacity for the original images when they 
are contaminated by outliers, e.g., occlusion and noises. Figure [9] gives the average recon- 
struction of the face images in both training and test sets of the Yale B and PIE datasets. It 
shows that MahNMF obtains clearer reconstruction than EucNMF and KLNMF. To further 
study MahNMF 's reconstruction capacity. Table [1] compares its relative error with the errors 
obtained by EucNMF and KLNMF on both Yale B and PIE datasets. The relative error 

is defined as ^^-^^^72^, wherein X and X' are the original image and reconstructed image, 
respectively. Here we only compare the relative errors when the reduced dimensionality is 
80. For other reduced dimensionalities, we make similar observations as shown in Table 
dJ Table [1] shows that MahNMF reconstructs the face images better in both the training 
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Figure 8: Face image examples (column a) of PIE dataset and the learned basis by KLNMF 
(column b), EucNMF (column c), and MahNMF (column d) in the absence of 
noise (1st row) and in the presence of occlusions (2nd row), additive Laplace 
noise (3rd row), Salt & Pepper noise (4th row), Gaussian noise (5th row), and 
multiplicative Poisson noise (6th row). 



Table 1: The relative errors of the reconstructions by EucNMF, KLNMF, and MahNMF 
on both Yale B and PIE datasets. 



Algorithm 


EucNMF 


KLNMF 


MahNMF 


EucNMF 


KLNMF 


MahNMF 


Yale B Dataset 


Training Set 


Test Set 


No noise 


.023 ± .001 


.032 ± .001 


.039 ± 


.002 


.055 ± 


.002 


.058 ± 


.002 


.060 ± 


.002 


Occlusion 


.080 ± .004 


.0117 ± .005 


.089 ± 


.004 


.095 ± 


.002 


.120 ± 


.002 


.097 ± 


.002 


Laplace noise 


.155 ± .005 


.157 ± .004 


.127 ± 


.005 


.140 ± 


.003 


.146 ± 


.003 


.122 ± 


.002 


Salt & Pepper 


.245 ± .010 


.228 ± .013 


.082 ± 


.008 


.133 ± 


.003 


.134 ± 


.004 


.079 ± 


.003 


Gaussian noise 


.241 ± .005 


.263 ± .005 


.246 ± 


.004 


.185 ± 


.004 


.194 ± 


.004 


.183 ± 


.004 


Poisson noise 


.028 ± .001 


.034 ± .002 


.044 ± 


.002 


.059 ± 


.002 


.060 ± 


.002 


.062 ± 


.002 


PIE Dataset 


Training Set 


Test Set 


No noise 


.010 ± .000 


.009 ± .000 


.012 ± 


.001 


.012 ± 


.000 


.012 ± 


.000 


.014 ± 


.000 


Occlusion 


.075 ± .002 


.113 ± .003 


.063 ± 


.002 


.070 ± 


.001 


.106 ± 


.001 


.064 ± 


.001 


Laplace noise 


.072 ± .002 


.079 ± .003 


.046 ± 


.001 


.056 ± 


.001 


.054 ± 


.001 


.044 ± 


.001 


Salt & Pepper 


.097 ± .004 


.115 ±.003 


.021 ± 


.003 


.052 ± 


.001 


.049 ± 


.001 


.020 ± 


.001 


Gaussian noise 


.024 ± .001 


.028 ± .001 


.024 ± 


.001 


.023 ± 


.000 


.024 ± 


.000 


.023 ± 


.000 


Poisson noise 


.013 ± .000 


.011 ± .000 


.012 ± 


.000 


.014 ± 


.000 


.013 ± 


.000 


.014 ± 


.000 



and test sets when the training set is contaminated by Laplace and Salt & Pepper noises. 
Therefore, MahNMF successfully handles the heavy-tailed noise and performs robustly in 
the presence of outliers. 
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Figure 9: Average reconstructed face images of KLNMF (1st and 4th row), EucNMF (2nd 
and 5th row), and MahNMF (3rd and 6th row) on the Yale B (lst-3rd row) and 
ORL (4th-6th row) datasets in absence of noise (column a) and in presence of 
occlusions (column b), additive Laplace noise (column c), Salt & Pepper noise 
(column d), Gaussian noise (column e), and multiplicative Poisson noise (column 
f). The left and right hand images are the averaged reconstructed images on the 
training set and test set, respectively. 



5.3 Image Clustering Study 

In this section, we first conduct a simple clustering experiment on face image datasets in 
the presence of different types of outliers to show its effectiveness in data representation. 
We then conduct image segmentation experiments to evaluate the clustering effectiveness 
of MahNMF- 5 KM, si nce the segmentation problem is intrinsically a clustering problem 
(|Wu and Leahvl . Il993l l. 



5.3.1 Face Image Datasets 

To evaluate the effectiveness and robustness of MahNMF in data representation, we conduct 
the clustering experiments on both Yale B and ORL datasets. We randomly select 4 to 36 
individuals from the Yale B dataset and 10 to 68 individuals from the ORL dataset to con- 
struct the test set X £ R^"^^^^, wherein N signifies the size of the test set. By factorizing 
the reweighted X with EucNMF, KLNMF, and MahNMF, we evaluate th eir clusteri r ig per - 
formance in terms of both accuracy (AC) and mutual information (MI) (jXu et al.1 . [2OO3I ) . 



To eliminate the randomness of individual selection, we repeat this trial 20 times and report 
the average AC and MI in Figures [TO] and [TTl 

Figures [ini and E] show that MahNMF outperforms both EucNMF and KLNMF on the 
Yale B and PIE datasets even though the test set is contaminated by occlusion, Laplace 
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Figure 10: Image clustering accuracy (a-f ) and mutual information (g-1) versus cluster num- 
ber of EucNMF, KLNMF, and MahNMF on the Yale B dataset in absence of 
noise (a &: g) and in presence of occlusion (b & h), additive Laplace (c &: i), Salt 
&: Pepper (d &; j), Gaussian noise (e & k), and Poisson noise (f & 1). 



noise, and salt & pepper noise. Figure [TT] (e) and (k) show that MahNMF performs better 
than EucNMF on the PIE dataset in the presence of Gaussian noise because this dataset 
contains serious illumination outliers and MahNMF can robustly recover the sparse and low 
rank representation while EucNMF cannot. 



5.3.2 Image Segmentation 



Spectral clustering methods such as normalized cuts (Ncuts, (jShi and Maiikl . l2nnnh ;i have 
been successfully used in image segmentation. They usually decompose the (normalized) 
Laplacian matrices by using Eigen decomposition and partition the pixels of image into two 
parts based on the second eigenvector. By recursi vely partitio n ing p ixels, Ncuts successfully 
segment a given image. Recently, Ding et al. ( Ding et al. . 20061 ) have proved that the 
symmetric EucNMF is equivalent to spectral clustering. We study the effectiveness of our 
MahNMF- 5 algorithm in image segmentation by comparing it with Ncuts. 

In th is experiment, we com pare MahNMF- 5 and Ncuts on the Berkeley segmentation 



dataset (jMartin et al.l . l200ll ). For a given image, we construct a graph G, wherein each 



node corresponding to a pixel and the edge between node i and node j has the weight Wij 
defined as the product of a feature similarity term and spatial proximity term. Following 
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Figure 11: Image clustering accuracy (a-f) and mutual information (g-1) versus cluster num- 
ber of EucNMF, KLNMF, and MahNMF on the PIE dataset in absence of noise 
(a & g) and in presence of occlusion (b &: h), additive Laplace (c & i), Salt & 
Pepper (d & j), Gaussian noise (e &: k), and Poisson noise (f & 1). 



(|Shi and Malikl . l200d l. we set 



?2 

Wi^ = e FX 



< r , (62) 
0, otherwise 



where dF(^^j-^ = \\F{i) — F{j)\\i^ and F{i) denotes the brightness of node i, dL^^ j-^ = \\L{i) — 
L{j)\\i^ and L{i) denote the spatial location of node i. The parameter r is used to suppress 
the correlation between two pixels that are relatively far from each other. Ncuts partition G 
based on the second eigenvector of the normalized Laplacian matrix L = I — D~^W D~'2 , 
wherein D is a diagonal matrix with Da = 'Y^^=i ^ij ^^id A'^ is the total pixel number. 



1 



Similarly, we factorize the normalized similarity matrix D~2WD~2 HH^ by using the 
proposed MahNMF-S'FM algorithm followed by clustering H with K-means to obtain labels 
for all the segments. The reduced dimensionality is set to 5 in MahNMF-S'FM and the 
segments number is set to 3 and 5, respectively. Figures [T2 t 1131 and 1141 give segmentation 
results for ten example images. 

Figures [T2 t 1131 and [14] show that MahNMP-^YM-l- K-means successfully separates the 
objects in these example images and its performance is mostly comparable with Ncuts. In 
some cases, e.g., the second rows in Figure [T2l 1131 and 1141 MahNMF-S'FM-l-K-means out- 
performs Ncuts. Figure [TSl shows the normalized similarity matrices, i.e., D~2WD~2^ for 
four example images and their low-rank approximations, i.e., HH^ . From Figure[T5](a) and 
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Figure 12: Three segments of four example images (a) by using MahNMF-^FM+K-means 
(b to d) and Ncuts (e to g). Each row corresponds to an image. 




Figure 13: Three segments of another two example images (a) by using MahNMF-S'KM+K- 
means (b to d) and Ncuts (e to g). Each row corresponds to an image. 



(b), we can find that MahNMF-S'yM clearly distinguishes the three classes of correlations 
between pixels of images of the first and second rows of Figure [T2l Therefore, low-rank 
representation helps K-means to correctly segment these images into three parts. Simi- 
larly, according to Figure [TCI (c) and (d), MahNMF-^ YM successfully finds the correlations 
between pixels of images of the second and last rows of Figure [T^ and thus the learned 
low-rank representation helps K-means to segment these images into five parts. It means 
that the MahNMF-^YM+K-means method performs well for image segmentation. 
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(9) (h) (i) (i! (It) 




Figure 14: Five segments of four example images (a) by using MahNMF- 5 KM -|-K- means 
(b to f) and Ncuts (g to k). Each row corresponds to an image. 



Note that the parameters 6f and (^l in ([62]) should be carefully selected ( Nascimento and Carvalhol . 

201 ll ). In this experiment, we normalized both dF and dL to [0, 1], and simply set 5f = .3 
and 6l = .7. Another critical parameter r was set to r = med{dL). Because this experiment 
aims to study the effectiveness of MahNMF-S'YM in image segmentation, we deduce tuning 
these parameters to future work. 



5.4 Sparse and Low-rank Decomposition 

In this section, we conduct background subtraction and shadow/illumination removal ex- 
periments to evaluate the sparse and low-rank decomposition c apability of MahNM F by 
compari ng it with robust prin cipal component analysis (RPCA0, (jCandes et al.l . l201lh ) and 
GoDed dZhou and Taol . 1201 ih . 



5.4.1 Background Subtraction 

In video surveillance, background modeling is a ch allenging task that models the back- 
ground and detects moving objects in the foreground ( Cheng et al. . 201 ll ). The background 
variation can be approximated by low-rank representation because video frames may share 
the same background. Moreover, the foreground objects, such as walkers or cars, occupy 
only few image pixels and thus can be considered as sparse noise. As discussed in Section 



1. http:/ /perception. csl.umc.edu/matrix-rank/sample_code. html 

2. https:/ /sites. google.com/site/godecomposition/code 
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Figure 15: Five segments of four example images (a) by using MahNMF-^yM+K-means 
(b to f) and Ncuts (g to k). Each row corresponds to an image. 
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Figure 16: Video frame (column a), background and foreground subtracted by RPCA (col- 
umn b and c), GoDec (column d and e), and MahNMF (column f and g) on 
'Hair (first two rows) and 'Lobby' (last two rows) video sequences. 



I, MahNMF naturally reveals the low-rank approximation of background and stores the 
foreground moving objects in noise. In this experiment, we evaluate its capability in back- 
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Figure 17: Video frame (column a), background and foreground subtracted by RPCA (col- 
umn b and c), GoDec (column d and e), and MahNMF (column f and g) on 
'Bootstrap' (first two rows) and 'Shopping Mall' (last two rows) video sequences. 



ground subtraction on four surveillance video^ (jLi et al. . 20041 ) including 'Hall', 'Lobby', 
'Bootstrap', and 'Shopping Mall', who se resolutions are 14 4 x 1 76, 128 x 160, 120 x 160, 
and 256 x 320, respectively. Similar to (jCandes et al. . 2011 ) and ( Zhou and Tao . 201 ll ). we 
select 200 frames from each video and resize each frame to a long vector to construct the 
data matrix X. We set the low rank of MahNMF to 2 and keep the parameter settings 
of RPCA and GoDec consistent with those given in their demos. Figures [16] and [17] give 
the background and foreground of the first two and last two videos (2 frames per video). 
These figures show that MahNMF successfully separates the background and foreground 
without losing detail. The results obtained by MahNMF are comparable to those obtained 
by RPCA and GoDec. 



5.4.2 Shadow/Illumination Removal 

According to the face recognition results in Section V.B, both shadow and illumination pull 
down the image quality and thus reduce the accuracy of many methods such as principal 
component analysis (PCA, ( Hotelling . 19331 )). However, MahNMF performs well on both 
Yale B and PIE datasets which are seriously contaminated by shadow and illumination. 
That is because MahNMF robustly estimates the low-rank representation of the face images. 
In this experiment, we further study the capability of MahNMF in shadow/illumination 
removal by comparing with RPCA and GoDec. 

We conducted MahNMF, RPCA, and GoDec on the images taken from single individual 
of the extended Yale B dataselQ which contains around 64 frontal face images taken from 



3. http:/ /perception. i2r.a-star.edu.sg/bk_model/bk_index.html 

4. http:/ /cvc. yale.edu/projects/yalefacesB/yalefacesB. html 
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Figure 18: The shadow/illumination removal results obtained by RPCA (X = A + E), 
GoDec {X = L + S + E), and MahNMF (X « W'^H) on face images of the first 
two individuals taken from the extended Yale B dataset, where the top two rows 
come from the first individual and the bottom two rows come from the second 
individual. 



38 individuals. By reshaping the 192 x 168 pixels of each image into a long vector, we got 
a 32764 x 64-dimensional data matrix. We set the low rank of MahNMF and GoDec to 3 
and kept other settings consistent with those in the background subtraction experiment. 
Figures [18] and [19] give the shadow/illumination removal results for four individuals. It 
shows that MahNMF successfully removes both shadow and illumination of these images. 
Such observation confirms those results obtained from Section V.B. From Figures [18] and 
\T9\ we can see that the performance of MahNMF is comparable to RPCA and GoDec. 



5.5 Multi-View Learning 

In several computer vision tasks, data sets often inherently involve many types of features 
such as pixels, gradient-based features, color-based features, and surrounding text, which 
represent the same image from different views. Many computer vision tasks such as image 
retrieval and image annotation have proven to be beneficial from these multiple views. 
One of the most important problems is how to learn a latent representation of t he data to 



leverage the information shared by the multiple views. To this end, Jia et al. (|Jia et al 



ioiol i proposed factorized space with a structured sparsity (FLSS) algorithm that learns a 
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Figure 19: The shadow/illumination removal results obtained by RPCA {X = A + E), 
GoDec {X = L + S + E), and MahNMF (X « W'^H) on face images of the first 
two individuals taken from the extended Yale B dataset, where the top two rows 
come from the third individual and the bottom two rows come from the fourth 
individual. 



latent space to factorize the information contained in multiple views into shared and private 
parts. Given a data set X G R^^^^ , where the features are composed of V views, FLSS 
learns to find the dictionary D and the coefficients H by optimizing 



1 

min — 

D,H N 



\X - D'^HWl + A ^ \\Dl^^ ||i,oo + tII^IIi, 



(63) 



where A > and 7 > are the weights of the gro up sparsity over D and H, respectively, 
and is the index for the v-ih view. According to ( Jia et al. . 2O10l ). the pararneters in ([63]) 
were s et to A = .01 and 7 = .01 in this experiment. In contrast to ( Jia et al. . 20101 ) . Kim 
et al. dKim et al.1 . l2012l l proposed a group sparse NMF to learn a low-rank representation 
of the multi-view data by incorporating the li^p norm over the basis matrix into EucNMF's 
loss function, i.e.. 



min —\\X 
W>0,H>G 2 



a 



V 

El 



[pv] 



\l,P + o ll-f^l 



2 



(64) 



where p = 2, 00 and a and (3 are the weights of the group sparsity of W and Tikhonove 
regularization over iJ, respectively. Here we term this algorithm EucNMF-G^ and set 
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its parameters as a 



10 



-2 



and /3 = 10 ^ according to ( Kim et al. . 20121 ). Although 



EucNMF-GS' achieves great success in multihngual text analysis, it is not robust in computer 
vision applications because the Euclidean distance-b ased model fails to ha ndle the heavy- 
tailed noise contained in some features such as SIFT (jjia and Darrelll201lh . The proposed 
MahNMF-GS' overcomes this problem and performs robustly in multi-view learning. In this 
experiment, we use the MahNMF- GS' model defined in ()40p with the parameters 7vy setting 
to 1 % of the g r oup s parsity over initial W^''\ wherein p G {1,...,V}. To keep consistent 
with (jjia et al.l . boid ). we set p = oo in both ^ and ([to|) . 




Figure 20: The mean average precisions (mAP) and sparsity patterns of the latent repre- 
sentation learned by MahNMF-G5 and EucNMF-G^ on both VoC Pascal 07 
(the 1st row) and Mir Flickr (the 2nd row) datasets. 



We evaluate the effectiveness of MahNMF-G^" in multi-view learning b y comparing it 



with FLSS and Euc NMF- GS on two challeng ing datasets, i.e., VOC Pascal 07 ([Everingham et al 



20071 ) and Mirflickr teuiskes and Lewi , [iooi i , which contain 10, 000 and 25, 000 natural im 



ages collected from 20 and 38 classes of objects, respectively. Both datasets contain sixteen 
types of features from which we select two types of gradient-based features including 1000- 
dimensional "DenseSift" and 512-dimensional "Gist" and one type of color related feature, 
i.e., 100-dimensional "DenseHue", in training. The training set is constructed by select- 
ing half the images and the remaining images make up the test set. We vary the reduced 
dimensionality from 100 to 1,000 for each dataset and obtain the latent spaces by using 
MahNMF-G^, FLSS, and EucNMF- GS'. In the test stage, both the training and test sam- 
ples are projected into the latent space and a SVM classifier for each class of object is 
constructed. Since this experiment aims to compare the different latent spaces learned by 



38 



MahNMF: Manhattan Non-negative Matrix Factorization 



MahNMF-G^, FLSS, and EucNMF-GS, all SVM classifiers use linear kernel. Based on the 
constructed classifiers, average precision (AP) is calculated for each class and the mean 
average precision (mAP) is calculated for evaluation. 



Table 2: Best mAP Values and the Corresponding Reduced Dimensionalities of MahNMF- 
GS, EucNMF-GS', and FLSS on Both VOC Pascal07 and Mir Flickr Datasets. 



Algorithm 


MahNMF- GS* 


EucNMF-G^ 


FLSS 


Dataset 
VOC Pascal 07 
Mir Flickr 


mAP rDim 
39.76% 800 
41.69% 1000 


mAP rDim 
35.29% 300 
36.89% 800 


mAP rDim 
32.15% 143 
32.04% 106 


rDim: reduced c 


imensionality. 



Figure [20l presents the mAP versus dimensionalities for MahNMF- and EucNMF- G5. 
It shows that MahNMF- GS' significantly outperforms EucNMF- GS* on both datasets. To 
study the latent spaces learned by different algorithms, we presented the average sparseness 
of each colum n of th e learn ed basis matrices in the second and third columns of Figure [20j 
According to ( Hover . 20041 ) . the sparseness of an A^-dimensional vector x is defined as 



SPR{x) 



N 



X 



Vn-1 



(65) 



According to (j65p . the fewer non-zeros the vector contains, the larger its sparseness. In 
Figures [20] (b), (c), (e) and (f), the x-axis denotes the feature dimensionalities concate- 
nating all the views and the y-axis denotes the average sparseness over different reduced 
dimensionalities. Figures [20] (b) and (e) show that the sparsity pattern of the basis learned 
by MahNMF- G^ for each view is similar, while the sparsity patterns for different views 
are different from one another. This implies that MahNMF- GS' successfully learns the 
private information for different views and thus works well in multi-view learning while 
EucNMF- GS does not work well. Table [2] shows the best mAP values and the correspond- 
ing dimensionalities of the latent subspaces learned by MahNMF- GS", EucNMF- GS" and 
FLSS, respectively. It shows that MahNMF- GS* outperforms FLSS on both VOC Pascal 07 
and Mir Flickr datasets because FLSS is not designed for classification. 



6. Conclusion 

This paper presents a general MahNMF framework to model the heavy-tailed Laplacian 
noise by minimizing the Manhattan distance between a data matrix and its low-rank ap- 
proximation. Compared to traditional NMF, MahNMF is much more robust to outliers 
including both occlusions and several types of noises, and thus performs well in both classi- 
fication and clustering. Since MahNMF naturally takes into account prior knowledge about 
the underlying low-rank structure of data and sparse structure of noise, it robustly recovers 
the low-rank and sparse parts of a non-negative matrix, and its performance is comparable 
to robust principal component analysis (RPCA) and GoDec in background subtraction and 
illumination/shadow modeling. While RPCA and GoDec are suitable for matrix comple- 
tion, MahNMF is well-suited for data representation with the non-negativity property of 
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data kept. The MahNMF problem is difficult to solve because the objective function is 
neither convex nor smooth. This paper proposes two fast optimization methods including 
rank-one residual iteration (RRI) and Nesterov's smoothing method for optimizing Mah- 
NMF. RRI successively updates each variable in MahNMF in a closed form solution and 
thus converges fast. However, its time complexity is high which makes RRI unsuitable for 
scaling to large scale matrices. The proposed Nesterov smoothing method overcomes this 
deficiency by optimizing MahNMF with an optimal gradient method on a smartly smoothed 
approximation function. By setting the smoothness parameter inversely proportional to the 
iteration number, the Nesterov smoothing method iteratively improves approximation ac- 
curacy and converges to an approximate solution of MahNMF. Under the MahNMF frame- 
work, we develop box constrained MahNMF, manifold regularized MahNMF, group sparse 
MahNMF, and elastic net inducing MahNMF and apply the proposed RRI and Nesterov's 
smoothing method to optimize them. Inspired by spectral clustering, we further develop 
symmetric MahNMF for image segmentation and discussed its equivalence to normalized 
cuts (Ncuts). Experimental results on several computer vision problems show that these 
MahNMF variants are comparable to traditional methods. 
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